Performance of T2-based PCA mix control chart with KDE control limit for monitoring variable and attribute characteristics

In this work, the mixed multivariate T2 control chart’s detailed performance evaluation based on PCA mix is explored. The control limit of the proposed control chart is calculated using the kernel density approach. Through simulation studies, the proposed chart’s performance is assessed in terms of its capacity to identify outliers and process shifts. When 30% more outliers are included in the data, the proposed chart provides a consistent accuracy rate for identifying mixed outliers. For the balanced percentage of attribute qualities, misdetection happens because of the high false alarm rate. For unbalanced attribute qualities and excessive proportions, the masking effect is the key issue. The proposed chart shows the improved performance for the shift in identifying the shift in the process.

Statistical process control (SPC) is a statistical methodology for monitoring and controlling the variation of a process to ensure that it produces products that meet customer requirements.A control chart, which is part of SPC, is one of the tools often used to monitor the company's quality of products and services 1 .Based on the number of monitored quality characteristics, the control charts are divided into two types: univariate and multivariate control charts.The univariate control charts monitor only one quality characteristic, while the multivariate control charts are applied to monitor more than one quality characteristic.
In the current industrial era 4.0, it is hoped that a process can not only be monitored from one type of quality characteristic.For example, in monitoring the variable characteristics (in a numerical scale such as height or weight), a control variable chart is used.Meanwhile, attribute control charts are always employed to monitor categorical or attribute data (such as color or hardness) 2 .Monitoring a mixed quality characteristic in the manufacturing process is important 3 .However, the monitoring procedure for mixed quality characteristics was commonly conducted in individual ways in the past.The inefficiency will happen due to the need for calculating two statistics and control limits.Consequently, the administrator will have hardship in determining the monitoring result if the two procedures yield a different result.Therefore, a new concept of monitoring mixed characteristics is urgently needed.
Ahsan et al. 4 proposed a new monitoring procedure based on the PCA Mix algorithm to overcome this issue.This work also extended to detecting outliers for various numbers of contaminated outliers 5 .The T 2 statistics are used to form the control chart in this method.Meanwhile, due to the unknown distribution, the control limit of the PCA Mix chart is estimated using the kernel density, a non-parametric method to estimate the empirical density from the unknown distribution 6 .However, in this work, the performance of the PCA Mix chart is only evaluated for one categorical data or attribute characteristic in detecting outliers.Additionally, both variable and attribute qualities are tracked in the effectiveness of the PCA Mix chart in identifying a change in the process.There is no suggestion for what shift this chart performs best, as a result.
Based on those reasons, this work is proposed to evaluate in detail the performance of the PCA Mix chart for detecting outliers and shift in the process.Similar to the PCA Mix chart proposed by Ahsan et al. 4 , the proposed chart also employed the kernel density estimation (KDE) in calculating the control limit.The proposed chart is evaluated for more than one attribute characteristic detecting outliers.On the other hand, the proposed chart is evaluated for a different kind of shift and correlation when the process change is being monitored.In this work, it is also shown how the proposed chart is used to monitor actual data and how its performance is compared.
The remaining portions of this work are structured as follows: Sect."Related works" reports the connected works of this research.The charting processes for the suggested method were provided in Sect."PCA mix".In Sections "Charting procedures" and "Performance in detecting outlier", performance assessments for identifying outliers and process adjustments are presented.Furthermore, Sect."Performance evaluation in monitoring process shift" illustrates how the suggested strategy is used to track the actual dataset.Section "Application in the real cases" provides a summary of the conclusion.

Related works
Recent advancements in the control chart are discussed in this section.This section covers three different categories of control charts: multivariate variable charts, attribute charts, and mixed charts.Three different multivariate control chart types such as Hotelling's T 2 , Multivariate EWMA, and Multivariate CUSUM are the main emphasis of this development.The three different multivariate variable charts' recent developments are summarized in Table 1.Table 2 lists the most current attribute chart works.The table demonstrates that current research has mostly concentrated on attribute charts using fuzzy, Poisson, and multinomial data.Recent advancements in the control chart are discussed in this section.In this section, the multivariate variable chart, attribute chart, and flow chart are the three primary forms of control charts that are covered.
Additionally, Table 3 displays the mixed control chart's most recent evolution.It is clear that a few works have looked at the mixed monitoring variable and attribute features in this field.Consequently, additional advancement in this field is required.In order to improve the monitoring process technique, this research aims to build and evaluate the performance of the mixed type chart, particularly the PCA mix control chart.

PCA mix
A statistical method called multivariate data analysis can be used to examine data that includes two or more quality factors.These qualities may either be attribute-or attribute-variable (interval-or ratio-based) (category).A statistical technique known as principal component analysis (PCA) is used to reduce the dimensions of continuous data, also known as variable characteristics in statistical process control (SPC).An extension of correspondence analysis (CA), multiple correspondence analysis (MCA) examines the relationships between a number of correlated categorical variables, also known as attribute characteristics in SPC.When the observations are categorical, MCA may be thought of as an extension of the PCA approach 35 .Thus, PCA Mix method is a combination of PCA and MCA that can be used to handle different types of quality characteristics together.
In this study, the PCA Mix technique is implemented in accordance with the strategy suggested by Chavent et al. 36 .Let n × p matrix X 1 and n × q matrix X 2 consist of variable and attribute characteristics, respectively, where n is the number of observations, p is the number of variable characteristics, and q is the number of attribute characteristics.An indicator matrix G with dimensions n × m provides binary coding for each attribute's degree of features, where m is the sum of all attribute level features.An n × (p + m) matrix Z = [Z 1 , Z 2 ] includes a real number component, where Z 1 and Z 2 are centred matrices of X 1 and G .Z is calculated as where N = 1 n I n is the rows' weights of Z, M = diag 1, ..., 1, n n 1 , ..., n n m is the weights of the columns of Z, the first p columns of Z are weighted by 1, and the last m columns are weighted by n n s , for s = 1, 2, . . ., m.The next step is solving the eigenvalue problem of Z using the Generalized Singular Value Decomposition (GSVD) in Chavent et al. 36 as , where 1 , 2 , . . ., r are the eigenvalues of Z, and r denotes the rank of Z. Matrix U , which has n × r dimensions, is an eigenvector of Z , and V is the (p + m) × r matrix of the eigenvectors of Z.As a result, the principal component of PCA mix may be calculated as with the size of n × r.

Charting procedures
The steps to create a multivariate control chart based on PCA Mix are covered in this section.The steps for building a multivariate control chart based on PCA Mix are shown in Fig. 1.There are three basic phases in the process.The PCs are initially calculated from the combined features using PCA Mix.The T 2 statistics are computed in the second phase using certain main components.Finally, use KDE to estimate the suggested chart's control limit.
(2)  24 Attribute chart for monitoring of mean and variance Comparing the suggested method to the conventional approach, the new way is simpler to implement Aldosari et al. 25 Multiple dependent state repetitive sampling (MDSRS) The suggested technique performs better than the traditional strategy based on repetitive sampling Aslam et al. 26 Shewhart neutrosophic attributes chart The suggested attribute control chart is effective at identifying changes in the process Chong et al. 27 Multi-attribute CUSUM-np chart The proposed method performs as well as or better than the traditional chart Aslam 28 Attribute chart with the repetitive sampling using the neutrosophic approach Compared to the current chart, the suggested chart with recurrent sampling under the neutrosophic system is better capable of detecting a change in the process Wibawati et al. 29 Fuzzy multinomial (FM) chart FM chart is capable of detecting shifts  www.nature.com/scientificreports/

PCA mix control chart's procedures
Step 1 Input the variable data X 1 and the attribute data X 2 Step 2 Calculate the principal component scores (PCs) mix, denoted as Y mix , using the PCA Mix method from X 1 and X 2 Step 3 Take the first v components and calculate , where v is the eigenvalue for the v-th PCs Step 4 Calculate the empirical density of T2 i statistics, , where h is the optimum bandwidth calculated using Botev, Grotowski, and Kroese algorithm 37 Step 5 Calculate the distribution function T2 i statistics, Step 6 Calculate the KDE control limit CL = F −1 h ( t)(1 − α) , when process is in-control Step 7 Plot the T2 i along with KDE control limit CL to form the PCA Mix Control Chart

Performance in detecting outlier
The effectiveness of the proposed chart in identifying outliers mingled with the in-control data is demonstrated in this section.Simulated studies involving various situations are carried out to evaluate its performance.
For the detailed performance, the number of attribute characteristics is evaluated for 2, 3, and 5. On the other hand, 5 variable characteristics is used with the number of observations n = 1000.The outliers mixed with the clean data are set to 5, 10, 20, 30, 40, and 50 percent out of the total observations.The proposed chart's accuracy may be assessed using the confusion matrix by categorizing the findings into four groups: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) (FN).The examples that were successfully identified as outliers are denoted by the letters TP, TN, FP, and FN, whereas the instances that were wrongly identified as outliers and not outliers are denoted by the letters FN and FP.The hit rate (HR), which can be computed using Eq. ( 4), is the accuracy level employed.www.nature.com/scientificreports/False positive rate (FPR) and false negative rate (FNR) are two categories under which the mistake rate in a confusion matrix may be subdivided.The percentage of cases that are wrongly labeled as positive is known as the FPR, whereas the percentage of instances that are incorrectly classed as negative is known as the FNR.Equations ( 5) and ( 6), respectively, are used to determine the FPR and FNR formulas: The detailed algorithm for simulation studies can be found in Ahsan et al. 5 .

Two attribute characteristics
Table 4 shows the performance of the proposed chart in detecting outliers for two attribute characteristics with θ 1 , θ 2 = 0.3 and θ 3 = 0.4.In general, the proposed chart still has a stable performance for no more than 30 percent outlier added to the clean data.For this case, it can be seen that the misdetection occurs due to a large number of the in-control data declared as an outlier (high FP rate).The proposed chart performance in detecting outliers for two attribute characteristics with imbalanced proportion is reported in Table 5.Unlike the previous case (two variables with balanced proportion), the misdetections are caused by the inability of the control chart to capture the actual outliers, which can be seen from the high FN rate.Furthermore, Table 6  www.nature.com/scientificreports/ the proposed chart to detect outliers for the extreme imbalanced proportion ( θ 1 , θ 2 = 0.05 dan θ 3 = 0.9 ).For this condition, it can be seen that the high value of the FN rate causes a low level of accuracy in the proposed chart.In general, using the number of components l = 2 produces better results for this case.

Three attribute characteristics
Proposed chart performance in outlier detection for three balanced attribute characteristics θ 1 , θ 2 = 0.3 and θ 3 = 0, 4 is presented in Table 7. Similar to the two attribute characteristics case, for this case, the misdetection happens due to the high false alarm produced represented by the high value of FP rate.Tables 8  and 9 show the performance for three attribute characteristics with imbalanced and extreme imbalanced proportions, respectively.In this case, it can be seen that the misdetection for these two cases happens due to the actual outliers are failed to be detected, represented by the high value of the FN rate.From this case, it also can be seen that using smaller principal components produces better results.The performance degradation can be seen when the proposed chart monitors more than 30 percent of outliers.Also, the more imbalanced proportion of the attribute characteristics, the higher the accuracy level produced.

Five attribute characteristics
Table 10 shows the outlier monitoring results for five attribute data with θ 1 , θ 2 = 0.3 and θ 3 = 0.4.Accord- ing to the simulation results, it can be concluded that, in this case, the misdetection occurs due to a large number of the in-control data declared as an outlier (see FP rate).The performances of the proposed chart for θ 1 , θ 2 = 0.1 and θ 3 = 0.8 as well as θ 1 , θ 2 = 0.05 and θ 3 = 0.9 are reported in Tables 11 and 12, respectively.Similar to the two previous cases, the failure to detect the actual outliers leads to reduced accuracy given by the www.nature.com/scientificreports/proposed chart.In general, the usage of the smaller principal component leads to higher accuracy.This chart is still at its peak performance for less than 40 percent outlier mixed.Moreover, the more imbalanced proportion of the attribute characteristics monitored by the proposed chart, the higher the Hit rate or accuracy produced.
Based on the simulation results about the performance of the proposed chart in detecting outliers, the following findings can be written as follows: 1.In general, the proposed chart only has good capabilities when used to monitor data with 30 percent outliers.2. When used to monitor attribute characteristics with balanced proportions, the chart's performance decreases due to high false alarms or swamping effects.3. When used to monitor attribute characteristics with imbalanced and extreme imbalanced, the proportion of diagram performance decreases due to high false negatives or masking effects.4. The proposed chart is suitable for monitoring outliers in attribute data with imbalanced and extreme imbalance proportions.

Performance evaluation in monitoring process shift
This part evaluates the proposed chart's effectiveness in order to inspect the process shift.Similar to the preceding part, attribute characteristics are created using a multinomial distribution with three different types of proportions, and variable characteristics are generated using a multivariate normal distribution.In this instance, the performance of the suggested chart is assessed for several types of shifts, such as a change in either variable characteristics, an attribute characteristics shift, or a shift in both variable and attribute characteristics.A new Table11.Performance of the proposed chart in identifying outliers for five attribute characteristics with θ 1 , θ 2 = 0.1 and θ 3 = 0.8.www.nature.com/scientificreports/kind of correlation is tested to see how well the suggested chart performs.Using the same approach as Ahsan et al. 4 , the ARL 1 is estimated by shifting the variable characteristics by µ shift = µ + δ µ , where δ µ = 0.1 and shift- ing the attribute characteristics by where δ θ = 0.0025.

Shift in variable characteristics
The proposed chart's performance is shown in Tables 13, 14 and 15 for the balanced, imbalanced, and extremely imbalanced proportions of attribute data, respectively.In general, using the KDE control limit, the proposed chart produces ARL 0 at around 370 for the false alarm rate α = 0.00273 .For the shift in only variable charac- teristics, the proposed chart can capture the change in the process by producing the lower ARL 1 for the larger shift given.For this case, better performance is achieved when it is used to monitor the balanced parameter of the attribute characteristics.www.nature.com/scientificreports/

Shift in attribute characteristics
The performances of the proposed chart with the shift in the attribute characteristics for balanced, imbalanced, and extreme imbalanced proportion parameters are sequentially presented in Tables 16, 17 and 18.For this case, using the KDE control limit, it can be found that the performance of the proposed chart for the in-control state is stable (see the ARL 0 value at around 370 for all scenarios α = 0.00273 ).Although the proposed chart can capture process shifts that occur in the attribute characteristics, the ARL 1 obtained does not drop as sharply as when detecting a shift in the variable characteristics.Also, the proposed chart performs better than existing chart, particularly when dealing with highly imbalanced data.

Shift in variable and attribute characteristics
This subsection presents the performance of the proposed chart for detecting the shift in both variable and attribute characteristics.Table 19 reports the performance of the proposed chart for the balanced situation of attribute characteristics.Meanwhile, the proposed chart's imbalanced and extreme performances are presented Table 15.ARLs for θ 1 , θ 2 = 0.05 and θ 3 = 0.9 with shift in the variable characteristics for p = 5.ARL 0 is in bold.20 and 21.From the results, it can be seen that there is a similarity performance with the performance of the proposed chart when it is used to monitor shifts in variable characteristics.The main difference in the performance lies in the type of shift.For small shifts, the proposed chart better monitors the shift in only variable characteristics.On the other hand, the shift in both variable and attribute characteristics produces better performance for the large shift.

Different correlation
This subsection presents the performance of the proposed chart for several coefficient correlations.In evaluating the performance of the proposed chart, the variable characteristics are generated with four types of correlation such as 0.3, 0.5, 0.7, and 0.9 using the KDE control limit.For this case, the process is shifted for both variable and attribute characteristics.The number of variable characteristics p is 5, and the number of principal components used l is 4. Also, the proposed chart is evaluated for three types of attribute characteristics as declared in the previous section.Table 18.ARLs for θ 1 , θ 2 = 0.05 and θ 3 = 0.9 with shift in the attribute characteristics for p = 5.ARL 0 is in bold.Table 22 shows the performance of the proposed chart for monitoring the balanced proportion of attribute characteristics ( θ 1 , θ 2 = 0.3 and θ 3 = 0.4 ) with several types of correlation.The proposed chart always produces the ARL 0 at about 370 for all scenarios for the in-control condition.The proposed chart can detect a shift when the process is shifted by producing smaller ARL 1 .For this case, better performance has achieved when the proposed chart monitors the process with a smaller coefficient correlation.

Shift Number of components l
Tables 23 and 24 reports the proposed chart's performance in monitoring the attribute characteristics' imbalanced and extreme imbalanced proportion.According to the tables, it can be concluded that for the in-control condition, the proposed chart produces the appropriate ARL 0 (around 370 for α = 0.00273 ).Similar to the previous result, the smaller coefficient correlation produces better performance, as seen from the ARL 1 value for each scenario.In addition, the proposed chart reaches its peak performance when it is used in monitoring data in a balanced proportion of attribute characteristics.
Based on the simulation results about the performance of the proposed chart in monitoring process shift, the following findings can be summarized as follows:

Machine failure dataset
This paragraph describes how the proposed chart is applied to a real-world scenario.The proposed chart is used to monitor of the machine failure dataset (attached as Excel file).This dataset have been used in Ref. 4 .There are 8784 samples in this dataset, along with 16 variable characteristics and 4 attribute qualities, one of which is labeling the observations.In this study, 8 out of 16 variable characteristics and 2 out of 3 attribute characteristics are chosen based on their mean deviation from the mean of the in-control process.While the second attribute characteristic contains four categories with a balanced percentage, the first attribute characteristic has eight with such ratio.Table 25 shows the performance of the proposed chart in monitoring the Machine Failure dataset.According to the table, it can be seen that the performance of the multivariate based on the PCA Mix surpasses the performance of the conventional T 2 chart.However, the PCA Mix chart with the F Distribution control limit has slightly better performance (see the Hit rate).Fortunately, the proposed chart demonstrates better performance than the other charts in detecting the real out-of-control observation.Based on the results, it can be seen that the proposed chart has better performance in detecting out-of-control signals compared to the others.This happened because the two attribute characteristics, which have a balanced proportion, increase the proposed method's accuracy level.

NSL-KDD dataset
The well-known NSL-KDD dataset (available in https:// www.kaggle.com/ datas ets/ hassa n06/ nslkdd) is being monitored using the proposed chart in this section.It is regarded as a typical benchmark for assessing intrusion detection 38 .Table 26 details the proposed chart's effectiveness in inspecting the NSL-KDD dataset.Based on the findings, we can see that the proposed chart performs better than the other charts.The proposed chart, which uses the KDE control limit, yields the highest hit rate and the lowest false positive rate.

Conclusions
This paper presents the detailed performance evaluation of the PCA Mix control chart in monitoring the mixed variable and attribute quality characteristics.Through some simulation studies with several cases, the performance evaluation shows the PCA Mix chart's ability to detect outliers and shifts in the process.The proposed chart still has a stable performance for no more than 30 percent outlier mixed.When the proposed chart is used to monitor more than one attribute characteristic with a balanced proportion, most misdetection occurs due to false alarms for more than 30 percent of outlier.On the other hand, in monitoring the attribute characteristics with imbalanced proportion, the proposed chart cannot detect actual outliers when it detects more than 30 percent of outliers.Furthermore, the performance of the proposed chart is also evaluated in detecting a shift in the process.The proposed chart shows an outstanding performance in monitoring the shift in only variable characteristics for the small shift in the process.The proposed chart demonstrated better performance for the shift in both variable and attribute characteristics for the large shift in the process.The proposed chart has a better performance in monitoring the smaller coefficient correlation.In addition, the proposed chart is also applied to monitor two datasets, and its performance is compared with the conventional method.The monitoring results show that compared to the other charts, the proposed chart has a higher accuracy detection by detecting more actual out-of-control observations with a low false alarm rate.
For future research, the performance of the proposed chart can be extended by adding some robust estimator in both the mean vector and covariance matrix.The bootstrap resampling method can be used to estimate the control limit of the proposed chart.The Squared Prediction Error (SPE) or Q statistic can be employed as an alternative for Hotelling's T 2 statistic in monitoring the mixed characteristics.Also, the effect of autocorrelation for the metric data is interesting issue need to be explored.
9control chart for high-dimensional data The suggested approach may be used with great accuracy without any preprocessing or dimension reduction Yenageh et al.9Adaptive MEWMA Approach for Monitoring Linear and Logistic Profiles The proposed chart performs better in monitoring Linear and Logistic Profiles

Table 2 .
Attribute chart's most recent advancement.

Table 3 .
Mixed chart's most recent advancement.
4PCA Mix chartComparing the proposed chart to other robust and traditional charts, it performs excellently in detecting more outliers with a larger percentage of outliers included Ahsan et al.4PCA Mix chart When a suitable number of primary components are chosen, the suggested chart displays strong performance Wang Su et al. 33 Multivariate sign chart Simulations demonstrate how effective the suggested control chart is in inspecting mixed-type data Aslam Azam et al. 34 Mixed chart The mixed chart displays good monitoring process performance Vol:.(1234567890)Scientific Reports | (2024) 14:7372 | https://doi.org/10.1038/s41598-024-58052-4

Table 6 .
Performance of the proposed chart in identifying outliers for two attribute characteristics with θ 1 , θ 2 = 0.05 and θ 3 = 0.9

Table 25 .
Proposed chart performance in monitoring the machine failure dataset.Significant values are in bold.

Table 26 .
Proposed chart performance in monitoring the NSL-KDD dataset.Significant values are in bold.